{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook reads in the raw L2 VoterMapping file for California and saves the demographic\n",
    "information for Santa Clara County such that it can be used for training and testing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(tidyverse)\n",
    "library(testit) # assertions\n",
    "set.seed(123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f <- function(x, pos) filter(x, `County` == \"SANTA CLARA\")\n",
    "\n",
    "start <- proc.time()\n",
    "# Replace the filepath below with the location of the raw L2 TAB file.\n",
    "scc_data <- read_tsv_chunked(\n",
    "    \"~/Downloads/VM2--CA--2020-12-05/VM2--CA--2020-12-05-DEMOGRAPHIC.tab\",\n",
    "    DataFrameCallback$new(f),\n",
    "    chunk_size = 100000\n",
    ")\n",
    "time <- proc.time() - start\n",
    "time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_csv(scc_data, \"../data/santa_clara_county_demographic_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scc_data <- na.omit(read_csv(\"../data/santa_clara_county_demographic_data.csv\") %>%\n",
    "    select(`Voters_FirstName`, `Voters_LastName`, `Residence_Addresses_AddressLine`, `Residence_Addresses_Zip`, `Voters_BirthDate`, `Languages_Description`) %>%\n",
    "    separate(`Residence_Addresses_AddressLine`, c(\"STREET_NUM\", \"ADDR\"), extra = \"merge\") %>%\n",
    "    rename(\n",
    "        `ZIP` = `Residence_Addresses_Zip`,\n",
    "        `first_name` = `Voters_FirstName`,\n",
    "        `last_name` = `Voters_LastName`,\n",
    "        `birthdate` = `Voters_BirthDate`,\n",
    "        `language` = `Languages_Description`\n",
    "    ) %>%\n",
    "    mutate(\n",
    "        STREET_NUM = as.numeric(gsub(\"[^0-9.-]\", \"\", `STREET_NUM`)),\n",
    "        ADDR = toupper(ADDR),\n",
    "        ZIP = as.character(ZIP),\n",
    "        birthdate = factor(birthdate)\n",
    "    ) %>%\n",
    "    filter(\n",
    "        !is.na(STREET_NUM) &\n",
    "        !is.na(ZIP) &\n",
    "        !is.na(ADDR)\n",
    "    ))\n",
    "\n",
    "saveRDS(scc_data, \"../data/test_input.rds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we then geocode to get CBG for each voter record. Voters that don't match to a CBG are omitted. After applying your geocoding method of choice, the dataframe should have\n",
    "a `GEOID` column that lists the CBG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data_scored <- readRDS(\"../data/test_input_geocoded.rds\")\n",
    "# Get the age and scores for each feature.\n",
    "input_data_scored <- get_component_language_scores(input_data_scored)\n",
    "\n",
    "# Assertion: The scoring algorithm should not result in any missing values, and none of the input features should be missing.\n",
    "assert(nrow(input_data_scored) == nrow(na.omit(input_data_scored)))\n",
    "\n",
    "saveRDS(input_data_scored, \"../data/test_input_scored.rds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the train/test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data_scored <- readRDS(\"../data/test_input_scored.rds\")\n",
    "train_index <- sample(1:nrow(input_data_scored), 0.8 * nrow(input_data_scored))\n",
    "train_df <- input_data_scored[train_index,]\n",
    "test_df <- input_data_scored[setdiff(1:nrow(input_data_scored), train_index),]\n",
    "saveRDS(train_df, \"../data/test_input_train.rds\")\n",
    "saveRDS(test_df, \"../data/test_input_test.rds\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
